A Scalable RDF Data Processing Framework based on Pig and Hadoop
نویسندگان
چکیده
In order to effectively handle the growing amount of available RDF data, scalable and flexible RDF data processing frameworks are needed. While emerging technologies for Big Data, such as Hadoop-based systems that take advantages of scalable and fault-tolerant distributed processing, based on Google’s distributed file system and MapReduce parallel model, have become available, there are still many issues when applying the technologies to RDF data processing. In this paper, we propose our RDF data processing framework using Pig and Hadoop with several extensions to solve the issues. We integrate an efficient RDF storage schema into our framework and then show the performance improvement from Pig’s standard bulk load and store operations, including the schema conversion cost from conventional RDF file formats. We also compare the performance of our framework to the existing single-node RDF databases. Furthermore, as reasoning is an important requirement for most RDF data processing systems, we introduce the user operation for inferring new triples using entailment rules and show the performance evaluation of the transitive closure operation as an example of the inference, on our framework.
منابع مشابه
2016 Olympic Games on Twitter: Sentiment Analysis of Sports Fans Tweets using Big Data Framework
Big data analytics is one of the most important subjects in computer science. Today, due to the increasing expansion of Web technology, a large amount of data is available to researchers. Extracting information from these data is one of the requirements for many organizations and business centers. In recent years, the massive amount of Twitter's social networking data has become a platform for ...
متن کاملTowards a distributed, scalable and real-time RDF Stream Processing engine
Due to the growing need to timely process and derive valuable information and knowledge from data produced in the Semantic Web, RDF stream processing (RSP) has emerged as an important research domain. Of course, modern RSP have to address the volume and velocity characteristics encountered in the Big Data era. This comes at the price of designing high throughput, low latency, fault tolerant, hi...
متن کاملEfficient and scalable processing of frequent SPARQL queries
A large amount of data is being published today in the RDF [6] framework using semantic mark-up. The number of triples currently published on the web is approximately 62 billion from 870 datasets [4]. We need frameworks that can query RDF data efficiently when the data sizes scale up. Incremental addition of data and computing resources is a fundamental aspect of cloud computing. We propose a f...
متن کاملAdaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملPigSPARQL: A SPARQL Query Processing Baseline for Big Data
In this paper we discuss PigSPARQL, a competitive yet easy to use SPARQL query processing system on MapReduce that allows adhoc SPARQL query processing on large RDF graphs out of the box. Instead of a direct mapping, PigSPARQL uses the query language of Pig, a data analysis platform on top of Hadoop MapReduce, as an intermediate layer between SPARQL and MapReduce. This additional level of abstr...
متن کامل